1 Introduction
1) Why this topic
With the development of Internet and signal techniques, smartphone plays an essential role in people’s life. Ten years ago, people only used mobile to text and make phone calls, but nowadays smartphones can do all kinds of stuff, such as surfing the internet, sending the emails, video chatting with friends or even paying bills etc. So, we want to see how popular mobile is in different countries and what patterns of internet request by mobile across time in one day are. Also, the mobile operators can get the locations of the users when they make requests, but with some signal uncertainty, it is also worthy to study the geo-location information and see the spatial patterns of signal accuracy.
In this project, we will focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas may produce very interesting results so we used mobile request data from these countries to exam the use of mobile and signal geo-location information in Middle East and Russian.
We used the data visualization and statistical method to understand mobile popularity and signal accuracy in different countries, found the underlying reason of different popularity and accuracy. In addition, we also developed a mobile signal tracking site called Mobile Signal. It lists mobile usage records across time for three days in a row both in Middle East and Russia as well as tools and information needed to be able to have a comprehensive understanding of the variations across countries or areas respectively.
2) Research Questions
From the data, we try to figure out four questions in popularity of mobile:
a. Number of mobile users in different countries
b. Prevalence of mobile internet usage in different countries
c. Development of mobile industry in different countries
d. Usage volume of mobile across time in one day
Also with this data, we can study more interesting topics using geo-location information:
a. Pattern of signal accuracy in different Mideast countries across time
b. The movement pattern of mobile internet request of some specific groups of people
3) How to find the data
The data was from Zhirui Wang’s intern company. It contains data of three days in 2016 from middle east countries and Russia. The original data is 17.6G. We upload them to Google Drive. We also used data from World Population and GDP data from World Bank.
2 Team members and distributed contributions
We have four group members: Zhirui Wang (zw2389), Xikai Chen (xc2358), Yaqing Wang (yw2902) and Chang Pan (cp2923). To start, Zhirui did all data cleaning, and plotted barcharts to visualize the number of records and accuracy. Yaqing then did analysis on the barcharts in order to get a basic idea of the dataset. Further, Zhirui made spatial visualizations of ego-location data in R to generate two maps: the number of records in Middle East and Russia. After the basic steps, Xikai created a shiny App and embedded Zhirui’s plots and maps in the app. He and Chang also managed to use animations to visualize the changes of dots in the maps over time. In addition, they optimized the app to enable filters to capture mobile usages in specific landmarks in Russia. At last, for the report, Yaqing is responsible for Introduction, Team and Middle East part of Main Analysis, Zhirui is responsible for Analysis of Data Quality, Russia part of Main Analysis, and Chang and Xikai are responsible for Executive Summary.
3 Data Quality
This is a data set of mobile signal data. It consists of two parts, the first part is from 22 countries in the Middle East during Dec 10-13 2016, and the second part is from Russia during Dec 10-12. First let us look at the data quality of the Mideast data:
The original data of Mideast has 1.93 Gigabytes, it has no column names and is tab delimited. The columns of the data are: Timestamp, IP Address, User ID, Latitude, Longitude, Accuracy, Country. The time stamp has precision to the second, while we will mostly focus on the analysis on the hour basis so we will convert it into hours later. The IP Address actually does not give much information under our analysis purpose, we will drop this variable to keep a smaller consumption of memory. The User ID is unique for every mobile phone, in the analysis of Mideast, we mainly focus on country level, so the User ID is also not informative, we will drop this column as well. Also, due to the country level analysis, the longitude and latitude is useless as long as we have the country name column, we will also drop the longitude and latitude here. The accuracy is a measure of the ‘confidence interval’ that how far the recorded location information might differ from the actual location. When we open Google map, there will be a somewhat transparent sky blue circle around our location, and the radius of that circle is the accuracy here. Maybe it is better to call it ‘inaccuracy’ because the larger the number, the less the confidence we have about the actual location. But in my company called it ‘accuracy’ so let us just keep this way. The final column is the country of the mobile phone. It is an ISO two letter abbreviation of each country, so we have to web-scrape a code book to convert the abbreviation into actual country name. The data cleaning function is as following:
library(rvest)
codebook <- 'http://www.worldatlas.com/aatlas/ctycodes.htm' %>%
read_html %>%
html_nodes('table') %>%
.[[1]] %>%
html_table()
CleanData_Mideast <- function(input,output){
library(tidyverse)
x <- read_delim(input,
"\t", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
colnames(x) <- c('Timestamp','IP Address','User ID','Latitude','Longitude','Accuracy','Country')
x$COUNTRY <- codebook$COUNTRY[match(x$Country,codebook$`A2 (ISO)`)]
x <- x %>% select(Timestamp,Accuracy,COUNTRY)
write_csv(x,output)
}
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-11a.txt","C:/Users/wang_/Desktop/2016-12-11a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-12a.txt","C:/Users/wang_/Desktop/2016-12-12a_new.csv")
CleanData_Mideast("C:/Users/wang_/Desktop/2016-12-13a.txt","C:/Users/wang_/Desktop/2016-12-13a_new.csv")
Then let us look at the data of Russia:
The original data of Russia has 15.6 Gigabytes, it is also tab delimited and has no column names. The columns of the data are Time Stamp, IP Address, User ID, Latitude, Longitude, Accuracy, Wi-Fi Networks Nearby, GSM Towers, Country. The Time Stamp, IP address and Accuracy are the same as the data of Mideast, we will remain the same processing method as in the Mideast part. In this part of analysis, we will focus on individual level, so the User ID, Latitude and Longitude information is very crucial, we will not drop them as in the previous part. The Wi-Fi Networks Nearby, GSM Towers and Country does not provide useful information for us to analyze, we will drop them. Thus, the code for cleaning the data is as follow:
CleanData_Russia <- function(input,output){
library(tidyverse)
x <- read_delim(input,
"\t", escape_double = FALSE, col_names = FALSE,
trim_ws = TRUE)
colnames(x) <- c('Time Stamp','IP Address','User ID','Latitude','Longitude','Accuracy','Wifi Networks Nearby','GSM Towers','Country')
x <- x %>% select(`Time Stamp`,`User ID`,`Latitude`,`Longitude`,`Accuracy`)
write_csv(x,output)
}
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-10b.txt","C:/Users/wang_/Desktop/2016-12-10b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-11b.txt","C:/Users/wang_/Desktop/2016-12-11b_new.csv")
CleanData_Russia("C:/Users/wang_/Desktop/2016-12-12b.txt","C:/Users/wang_/Desktop/2016-12-12b_new.csv")
The data is the records from telecom company, they by nature have no missing value or outliers. However, there are records that do not belong to one day appear in the data of that day. In the analysis process, we will do the data cleaning process to drop all these rows.
We have integrated the first 1000 rows of data from each day in Mideast and Russia into our shiny app under the data tab. The code of the shiny app is on the Github.
4 Executive Summary
Mobile usage is an appealing topic. On the micro aspect, studying one’s internet request records can help us know about one’s living circle, lifestyle and etc,. on the macro aspect, analyzing people’s mobile request data as well as mobile signal accuracy reveals a lot about the country’s development, level of wealth, or even the countries’ infrastructure development.
In this project, we focus on some specific areas in the world. The middle east countries are all close to each other while they have very different national conditions. Russia has vast territory so the mobile signal may vary across the country. The study of these two areas obtains some interesting results, and we are going to present some revealing findings in this short summary.
First, we took a look at Middle East countries. Here is simply a plot showing the number of mobile usage records in different countries.
From the plot above we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of these three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran, which is totally unexpected.
However, given the uniqueness of Cyprus, it make sense. As a small island country with great tourism resources, Cyprus has tourists from nearby countries all year around. People love to go there for little breaks. Especially for Europeans, Cyprus is just a short-flight away, and has much lower living expense than most European countries and other Mideast tourism countries like United Arab Emirates. Therefore it is highly possible that a large portion of the number of records comes from foreign tourists.
In addition, since Cyprus is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern. For example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.
Besides the number of records, the other interesting feature in this data set is the accuracy. It is a measure of the ‘confidence interval’ that the recorded location information differs from the actual one. So actually the larger the value is, the less confidence we have about the location.
From the plot above, there are four countries have mean accuracy over 2000 meters: Libya, Iraq, South Sudan and Democratic Republic of Congo. All of them are in upheaval or experienced huge turbulence. By search on the internet, we notice that the smoke from the wars also affect the mobile signal, hence, affect the accuracy. Additionally, in unstable countries like these, the base stations are easily getting damaged, and there is no extra money, people, resources or motivations for someone to develop the mobile industry, both mobile phones and base stations. So they are expected to have highest accuracy.
On the other hand, Cyprus again beats other countries to be the best in terms of accuracy in Middle East. Besides its stable political situation, the majority of the country is plain, which is beneficial for base stations. Furthermore, as a popular tourist destination, it has the motivation to build a better environment. For example, high quality infrastructures, for internet users in order to attract more tourists. In turn, tourist who probably come from richer countries would use high-quality cell phones. And all of these could lead to better accuracy.
Now, we move to Russia. Here, we made a closer look at the moving pattern of specific groups, and found some interesting patterns of tourist on Moscow in the map.
Take this plot as an example, the blue dots represent the positions of a certain person, which clearly reveal the moving pattern of the person. Since they all perfectly lined up with each other, it seems that the person was queuing in the line for a traveling sight or maybe just a restaurant. However, similar lining patterns were not everywhere in the map as it was supposed to be. People do not line up to purchase the tickets or wait to enter the sights, so it is safe for us to conclude that this period might not be during a busy traveling season in Moscow, and tourist could go wherever they want to visit without waiting a line.
Another finding is that many of the cluster of internet request is on a bridge or at the waterfront, this may be due to the fact that bridges and waterfront are great places for photography, people may take photos there using their mobile phones, and then upload onto the social media, which requires internet request.
To sum it up, in Middle East, the larger or the richer a country is, the more mobile usage records and better accuracy it has, with one exception: Cyprus. As a great tourism country, it attracts many foreign visitors, who make great contributions to the country’s mobile usage records and in turn bring high-tech smartphones to motivate Cyprus for better base stations. Also, as expected, countries in upheaval or even in war have fewer records and larger accuracy. As in Russia, based on the internet request showing in the map, we can conclude that it is highly likely a off-season for tourism in Moscow, and when people visit natural sights such as pond, they tend to use cell phones for internet more than often.
5 Main Analysis
Middle East
library(plotly)
library(tidyverse)
library(gganimate)
library(lubridate)
library(forcats)
library(biglm)
library(lmtest)
library(knitr)
library(leaflet)
X2016_12_11a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-11a_new.csv",progress=F)
X2016_12_12a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-12a_new.csv",progress=F)
X2016_12_13a_new <- read_csv("C:/Users/wang_/Desktop/2016-12-13a_new.csv",progress=F)
x <- bind_rows(X2016_12_11a_new,X2016_12_12a_new,X2016_12_13a_new)
Number of records
First we took a look at the number of records in each Middle East country. For the purpose of comparison, we have two options: barcharts and piecharts. However, with more than 20 countries in total, it is difficult to identify the slice for a country with small portion, or even compare it with a smaller slice using piecharts, therefore, we settled with barcharts.
x_group_count <- x %>%
filter(Timestamp>as.Date('2016-12-12 00:00:00 UTC')) %>%
group_by(COUNTRY) %>%
summarise(count=n()) %>%
arrange(count)
(x_group_count %>%
ggplot(aes(y=count,x=as_factor(COUNTRY)))+
geom_bar(stat="identity",fill='skyblue2')+
coord_flip()+
ylab('Number of Records')+
xlab('Country Name')+
ggtitle('Number of Records by Country')) %>%
ggplotly
In order to draw this graph, first we filter the data to drop the rows that does not belong to these three days. Then we group the data by country and count the number of each country. We want the bar chart to be sorted by the number of records, so we arrange the count variable and use as_factor() function to remain this order when passing to ggplot. We also flip the coordinates for better visualization of the country names.
From the plot above, we can see that Turkey, Saudi Arabia and United Arab Emirates have the highest number of records, approaching 1.5 million. All of the three countries rank the top in the Middle East Total GDP Ranking list, so it makes sense that these countries have the largest numbers of active mobile users. In general, the smaller or the poorer countries tend to have fewer records. However, the number of records in small countries like Cyprus is more than four times of that in large countries like Iran.
There are two obvious factors associated with the number of records: national population and GDP, and we are going to explore them one by one. Here, we first start with population. By dividing the number of records by the national population, we obtain the proportion of the active users in the national population, which could be a measure of how prevalent mobile usage is in each country.
population <- "http://www.worldometers.info/world-population/population-by-country/" %>%
read_html %>%
html_nodes('table') %>%
.[[1]] %>%
html_table() %>%
.[,2:3]
colnames(population)[2] <- 'Population'
population$Population <- population$Population %>% gsub(',','',.) %>% as.numeric()
a <- match(x_group_count$COUNTRY,population$`Country (or dependency)`)
a[a %>% is.na %>% which] <- c(16,61,121,17)
x_group_count$Population <- population$Population[a]
(x_group_count %>%
mutate(percentage=100*count/Population) %>%
ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
geom_bar(stat="identity",fill='skyblue2')+
coord_flip()+
ylab('Number of Records/Population')+
xlab('Country Name')+
ggtitle('Number of Records to Population by Country')) %>%
ggplotly
In order to draw this graph, we first scrape world population information from Internet, and then match the table onto our original data according to country name. We divide the number of records by the population and times 100 to get the percentage, then draw the graph use the same technique as the previous part. An alternative option is to use the number of unique ID as the numerator to get the percentage. However, we want the number of internet request each user made can also be included in this index, thus we choose to use the number of records to get the percentage.
We can know from the plot that Turkey, Saudi Arabia and United Arab Emirates do not rank the highest any more, instead Cyprus has an enormously larger number than other countries: nearly a half of the country population has made internet request in three days. We think this is because the fact that Cyprus is a small island country with great tourism resources. People from nearby countries love to go there for a little break, especially Europeans. So it is highly possible that the large number of records consist of great many of foreign tourists. Other than Cyprus, it seems that richer countries generally have more records per person than poorer ones.
To confirm this finding, we choose to examine the ratio of number of active users to the total GDP, which can be a measure of the relative development of mobile industry to the whole industry in each country. We find the total GDP data from world bank data set, and divide the number of records to total GDP.
url <- 'http://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv'
temp <- tempfile()
download.file(url, temp, mode="wb")
unzip(temp, "API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv")
totalgdp <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2.csv",skip = 4)[,c(1,2,60)]
unlink(temp)
totalgdp$Country <- codebook$COUNTRY[match(totalgdp$`Country Code`,codebook$`A3 (UN)`)]
a <- match(x_group_count$COUNTRY,totalgdp$Country)
x_group_count$gdp <- totalgdp$`2015`[a]
(x_group_count %>%
mutate(percentage=count/gdp) %>%
ggplot(aes(y=percentage,x=as_factor(COUNTRY)))+
geom_bar(stat="identity",fill='skyblue2')+
coord_flip()+
ylab('Number of Records/Total GDP')+
xlab('Country Name')+
ggtitle('Number of Records to Total GDP by Country')) %>%
ggplotly
In order to draw this graph, we first automatically download the GDP data from World Bank Open Data, use the same way as above to match it to the original data, and then divide the number of records by total GDP. An alternative option here is to use GDP per capita, and then we will get a measure of an Engel-coefficient-like index of the mobile phone usage. However here we want to focus on the country level analysis, so we just go with the relative development of mobile industry.
We can see that again Cyprus surpasses the other countries by a huge amount. Since it is a small island country which do not have much potential for agriculture or industry, it is reasonable to suggest that mobile industry has a higher relative development in Cyprus than in other Middle East countries, which is consistent with the earlier reasoning that Cyprus have prosperous tourism. However, there are still other possible explanations for this pattern, for example, the data itself might come from a single carrier, which could be based in Cyprus. Then, the huge volume of records in Cyprus would make more sense, since in other countries, people may use other major carriers and such great number of records is invisible in this dataset.
After comparing the data across countries, we now can compare them across time.
First, we want to see how the number of records varies across time, so we plot two animation interactive graphs that evolve as time goes by.
This is a screen shot of the interactive map(In shiny app, it is Number of Records in Mideast map tab):